5 research outputs found
Feature Learning and Signal Propagation in Deep Neural Networks
Recent work by Baratin et al. (2021) sheds light on an intriguing pattern
that occurs during the training of deep neural networks: some layers align much
more with data compared to other layers (where the alignment is defined as the
euclidean product of the tangent features matrix and the data labels matrix).
The curve of the alignment as a function of layer index (generally) exhibits an
ascent-descent pattern where the maximum is reached for some hidden layer. In
this work, we provide the first explanation for this phenomenon. We introduce
the Equilibrium Hypothesis which connects this alignment pattern to signal
propagation in deep neural networks. Our experiments demonstrate an excellent
match with the theoretical predictions.Comment: 35 page
Do deep neural networks have an inbuilt Occam's razor?
The remarkable performance of overparameterized deep neural networks (DNNs)
must arise from an interplay between network architecture, training algorithms,
and structure in the data. To disentangle these three components, we apply a
Bayesian picture, based on the functions expressed by a DNN, to supervised
learning. The prior over functions is determined by the network, and is varied
by exploiting a transition between ordered and chaotic regimes. For Boolean
function classification, we approximate the likelihood using the error spectrum
of functions on data. When combined with the prior, this accurately predicts
the posterior, measured for DNNs trained with stochastic gradient descent. This
analysis reveals that structured data, combined with an intrinsic Occam's
razor-like inductive bias towards (Kolmogorov) simple functions that is strong
enough to counteract the exponential growth of the number of functions with
complexity, is a key to the success of DNNs
Is SGD a Bayesian sampler? Well, almost
Overparameterised deep neural networks (DNNs) are highly expressive and so
can, in principle, generate almost any function that fits a training dataset
with zero error. The vast majority of these functions will perform poorly on
unseen data, and yet in practice DNNs often generalise remarkably well. This
success suggests that a trained DNN must have a strong inductive bias towards
functions with low generalisation error. Here we empirically investigate this
inductive bias by calculating, for a range of architectures and datasets, the
probability that an overparameterised DNN, trained with
stochastic gradient descent (SGD) or one of its variants, converges on a
function consistent with a training set . We also use Gaussian processes
to estimate the Bayesian posterior probability that the DNN
expresses upon random sampling of its parameters, conditioned on .
Our main findings are that correlates remarkably well with
and that is strongly biased towards low-error and
low complexity functions. These results imply that strong inductive bias in the
parameter-function map (which determines ), rather than a special
property of SGD, is the primary explanation for why DNNs generalise so well in
the overparameterised regime.
While our results suggest that the Bayesian posterior is the
first order determinant of , there remain second order
differences that are sensitive to hyperparameter tuning. A function probability
picture, based on and/or , can shed new light
on the way that variations in architecture or hyperparameter settings such as
batch size, learning rate, and optimiser choice, affect DNN performance
Automatic Gradient Descent: Deep Learning without Hyperparameters
The architecture of a deep neural network is defined explicitly in terms of
the number of layers, the width of each layer and the general network topology.
Existing optimisation frameworks neglect this information in favour of implicit
architectural information (e.g. second-order methods) or architecture-agnostic
distance functions (e.g. mirror descent). Meanwhile, the most popular optimiser
in practice, Adam, is based on heuristics. This paper builds a new framework
for deriving optimisation algorithms that explicitly leverage neural
architecture. The theory extends mirror descent to non-convex composite
objective functions: the idea is to transform a Bregman divergence to account
for the non-linear structure of neural architecture. Working through the
details for deep fully-connected networks yields automatic gradient descent: a
first-order optimiser without any hyperparameters. Automatic gradient descent
trains both fully-connected and convolutional networks out-of-the-box and at
ImageNet scale. A PyTorch implementation is available at
https://github.com/jxbz/agd and also in Appendix B. Overall, the paper supplies
a rigorous theoretical foundation for a next-generation of
architecture-dependent optimisers that work automatically and without
hyperparameters
Neural networks are a priori biased towards Boolean functions with low entropy
Understanding the inductive bias of neural networks is critical to explaining
their ability to generalise. Here, for one of the simplest neural networks -- a
single-layer perceptron with n input neurons, one output neuron, and no
threshold bias term -- we prove that upon random initialisation of weights, the
a priori probability P(t) that it represents a Boolean function that classifies
t points in {0,1}^n as 1 has a remarkably simple form: P(t) = 2^{-n} for 0\leq
t < 2^n.
Since a perceptron can express far fewer Boolean functions with small or
large values of t (low entropy) than with intermediate values of t (high
entropy) there is, on average, a strong intrinsic a-priori bias towards
individual functions with low entropy. Furthermore, within a class of functions
with fixed t, we often observe a further intrinsic bias towards functions of
lower complexity. Finally, we prove that, regardless of the distribution of
inputs, the bias towards low entropy becomes monotonically stronger upon adding
ReLU layers, and empirically show that increasing the variance of the bias term
has a similar effect